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We describe and experimentally evaluate a method for 
automatically clustering words according to their dis- 
tribution in particular syntactic contexts. Determinis- 
tic annealing is used to find lowest distortion sets of 
clusters. As the annealing parameter increases, exist- 
ing clusters become unstable and subdivide, yielding a 
hierarchical "soft" clustering of the data. Clusters are 
used as the basis for class models of word coocurrence, 
and the models evaluated with respect to held-out test 
data. 

INTRODUCTION 

Methods for automatically classifying words according 
to their contexts of use have both scientific and prac- 
tical interest. The scientific questions arise in connec- 
tion to distributional views of linguistic (particularly 
lexical) structure and also in relation to the question 
of lexical acquisition both from psychological and com- 
putational learning perspectives. From the practical 
point of view, word classification addresses questions 



of dat fi sparseness and generalization in statistical lan- 
guage models, particularly models for deciding among 
alternative analyses proposed by a grammar. 

It is well known that a simple tabulation of frequen- 
cies of certain words participating in certain configura- 
tions, for example of frequencies of pairs of a transitive 
main verb and the head noun of its direct object, can- 
not be reliably used for comparing the likelihoods of dif- 
ferent alternative configurations. The problem is that 
for large enough corpora the number of possible joint 
events is much larger than the number of event occur- 
rences in the corpus, so many events are seen rarely 
or never, making their frequency counts unreliable es- 
timates of their probabilities. 



Hindle ( |1990D proposed deahng with the sparseness 
problem by estimating the likelihood of unseen events 
from that of "similar" events that have been seen. For 
instance, one may estimate the likelihood of a particular 



direct object for a verb from the likelihoods of that di- 
rect object for similar verbs. This requires a reasonable 
definition of verb similarity and a similarity estimation 
method. In Hindle's proposal, words are similar if we 
have strong statistical evidence that they tend to par- 
ticipate in the same events. His notion of similarity 
seems to agree with our intuitions in many cases, but 
it is not clear how it can be used directly to construct 
word classes and corresponding models of association. 

Our research addresses some of the same questions 
and uses similar raw data, but we investigate how to 
factor word association tendencies into associations of 
words to certain hidden senses classes and associations 
between the classes themselves. While it may be worth- 
while to base such a model on preexisting sense classes 
(Resnik, 1992), in the work described here we look at 
how to derive the classes directly from distributional 
data. More specifically, we model senses as probabilis- 
tic concepts or clusters c with corresponding cluster 
membership probabilities p{c\w) for each word w. Most 
other class-based modeling techniques for natur al lan- 
guage rely i nstead on "hard" Boolean classes (Brown 
et al., 1990 ). Class construction is then combinatori- 



ally very demanding and depends on frequency counts 
for joint events involving particular words, a potentially 
unreliable source of information as we noted above. Our 
approach avoids both problems. 

Problem Setting 

In what follows, we will consider two major word 
classes, V and TV, for the verbs and nouns in our exper- 
iments, and a single relation between them, in our ex- 
periments relation between a transitive main verb and 
the head noun of its direct object. Our raw knowl- 
edge about the relation consists of the frequencies fyn 
of occurrence of particular pairs {v, n) in the required 
configuration in a training corpus. Some form of text 
analysis is required to collect such a collection of pairs. 
The corpus used in our first experiment was derived 
from newswire text automatically parsed by Hindle's 



parser Fidditch ( Hindle, 1993| ) . More recently, we have 
constructed similar tables with the help of a statisti- 
cal part-of-speech tagger ( phurch, 1988 ) and of tools 
for regular expression pattern matching on tagged cor- 
pora ( Yarowsky, 1992| ). We have not yet compared the 
accuracy and coverage of the two methods, or what sys- 
tematic biases they might introduce, although we took 
care to filter out certain systematic errors, for instance 
the misparsing of the subject of a complement clause 
as the direct object of a main verb for report verbs like 
"say". 

We will consider here only the problem of classifying 
nouns according to their distribution as direct objects 
of verbs; the converse problem is formally similar. More 
generally, the theoretical basis for our method supports 
the use of clustering to build models for any n-ary rela- 
tion in terms of associations between elements in each 
coordinate and appropriate hidden units (cluster cen- 
troids) and associations between those hidden units. 

For the noun classification problem, the empirical dis- 
tribution of a noun n is then given by the conditional 
density = /im/I]„/™. The problem we study 

is how to use the p„ to classify the n G A/". Our clas- 
sification method will construct a set C of clusters and 
cluster membership probabilities p{c\n). Each cluster c 
is associated to a cluster centroid pc, which is discrete 
density over V obtained by averaging appropriately the 

Pn- 

Distributional Similarity 

To cluster nouns n according to their conditional verb 
distributions p„, we need a measure of similarity be- 
tween distributions. We use for this purpose the rela- 
tive entropy or Kullhack-Leibler (KL) distance between 
two distributions 



D{p II q) = ^j3(a;)log 



q{x) 



This is a natural choice for a variety of reasons, which 
we will just sketch here.[] 

First of all, D{p \\ q) is zero just in case p = q, and it 
increases as the probability decreases that p is the rel- 
ative frequency distribution of a random sample drawn 
according to p. More formally, the probability mass 
given by q to the set of all samples of length n with rel- 
ative frequency distribution p is bounded by 2^"-^^^'ll'?^ 
( pover and Thomas, 1991 ). Therefore, if we are trying 
to distinguish among hypotheses qi when p is the rel- 
ative frequency distribution of observations, D{p \\ qi) 
gives the relative weight of evidence in favor of qi. Fur- 
thermore, a similar relation holds between D{p \\ p') for 

more formal discussion will appear in our paper Dis- 
tributional Clustering, in preparation. 



two empirical distributions p and p' and the probability 
that p and p' are drawn from the same distribution q. 
We can thus use the relative entropy between the con- 
text distributions for two words to measure how likely 
they are to be instances of the same cluster centroid. 

From an information theoretic perspective D(p \\ q) 
measures how inefficient on average it would be to use 
a code based on q to encode a variable distributed ac- 
cording to p. With respect to our problem, D{pn \\ Pc) 
thus gives us the loss of information in using cluster 
centroid Pc instead of the actual distribution for word 
Pn when modeling the distributional properties of n. 

Finally, relative entropy is a natural measure of sim- 
ilarity between distributions for clustering because its 
minimization leads to cluster centroids that are a simple 
weighted average of member distributions. 

One technical difficulty is that D{p \\ p') is not de- 
fined when p'[x) = but p[x) > 0. We could sidestep 
this problem (as we did initially) by smoothing zero fre- 
quencies appropriately (Church and Gale, 1991). How- 
ever, this is not very satisfactory because one of the 
goals of our work is precisely to avoid the problems of 
data sparseness by grouping words into classes. It turns 
out that the problem is avoided by our clustering tech- 
nique, since it does not need to compute the KL dis- 
tance between individual word distributions, but only 
between a word distribution and average distributions, 
the current cluster centroids, which are guaranteed to 
be nonzero whenever the word distributions are. This 
is a useful advantage of our method compared with ag- 
glomerative clustering techniques that need to compare 
individual objects being considered for grouping. 

THEORETICAL BASIS 

In general, we are interested on how to organize a set 
of linguistic objects such as words according to the con- 
texts in which they occur, for instance grammatical con- 
structions or n-grams. We will show elsewhere that the 
theoretical analysis outlined here applies to that more 
general problem, but for now we will only address the 
more specific problem in which the objects are nouns 
and the contexts are verbs that take the nouns as direct 
objects. 

Our problem can be seen as that of learning a joint 
distribution of pairs from a large sample of pairs. The 
pair coordinates come from two large sets TV and V, 
with no preexisting topological or metric structure, and 
the training data is a sequence S of N independently 
drawn pairs 



l<i< N . 



From a learning perspective, this problem falls some- 
where in between unsupervised and supervised learn- 



ing. As in unsupervised learning, the goal is to learn 
the underlying distribution of the data. But in contrast 
to most unsupervised learning settings, the objects in- 
volved have no internal structure or attributes allowing 
them to be compared with each other. Instead, the only 
information about the objects is the statistics of their 
joint appearance. These statistics can thus be seem as a 
weak form of object labelling analogous to supervision. 

Distributional Clustering 

While clusters based on distributional similarity are in- 
teresting on their own, they can also be profitably seen 
as a means of summarizing a joint distribution. In par- 
ticular, we would like to find a set of clusters C such 
that each conditional distribution Pn{v) can be approx- 
imately decomposed as 



Pn{v) = '^p{c\n)pc(v) 



where p{c\n) is the membership probability of n in c 
and PcW) — v{^\c) is w's conditional probability given 
by the centroid distribution for cluster c. 

The above decomposition can be written in a more 
symmetric form as 



V{n,v) = ^p{c,n)p{v\c) 



cec 



= ^P{c)p{n\c)p{v\c) 



(1) 



cec 



assuming that p{n) and p{n) coincide. We will take 
as our basic clustering model. 

To determine this decomposition we need to solve the 
two connected problems of finding find suitable forms 
for the cluster membership and centroid distributions 
p{v\c), and of maximizing the goodness of fit between 
the model distribution p{n, v) and the observed data 

Goodness of fit is determined by the model's like- 
lihood of the observations. The maximum likelihood 
(ML) estimation principle is thus the natural tool to 
determine the centroid distributions pdv). 

As for the membership probabilities, they must be 
determined solely by the relevant measure of object-to- 
cluster similarity, which in the present work is the rel- 
ative entropy between object and cluster centroid dis- 
tributions. Since no other information is available, the 
membership is determined by maximizing the config- 
uration entropy subject for a fixed average distortion. 
With the maximum entropy (ME) membership distri- 
bution, ML estimation is equivalent to the minimization 
of the average distortion of the data. The combined en- 
tropy maximization entropy and distortion minimiza- 
tion is carried out by a two-stage iterative process sim- 
ilar to the EM method ([Dempster et al., 1977|). The 



first stage of an iteration is a maximum likelihood, or 
minimum distortion, estimation of the cluster centroids 
given fixed membership probabilities. In the second 
iteration stage, the entropy of the membership distri- 
bution is maximized with a fixed average distortion. 
This joint optimization searches for a saddle point in 
the distortion-entropy parameters, which is equivalent 
to minimizing a linear combination of the two known 
as free energy in statistical mechanics. This analogy 
with statistical mechanics is not coincidental, and pro- 
vide us with a better understanding of the clustering 
procedure. 

MELximum Likelihood Cluster Centroids For the 

maximum likelihood argument, we start by estimating 
the likelihood of the sequence S oi N independent ob- 
servations of pairs {ni,Vi). Using (Q), the sequence's 
model log likelihood is 

N 



1{S) = \ogp{S) = y^Jog^ p{c)p{ni\c)p{v^\c) 



i=l 



Fixing the number of clusters (model size) |C|, we 
want to maximize 1{S) with respect to the distributions 
p(n|c) and p{v\c). The variation of 1{S) with respect to 
these distributions is 



JV 



1 



c£C 



p{c) 



p{vi\c)5p{ni\c) 
+ 

p{ni\c)5p{vi\c) 



(2) 



with p{n\c) and p{v\c) kept normalized. Using Bayes's 
formula, we have |^ 

p[n^\c)p{Vi\c) = — p[ni,Vi) , 



1 



p{c) 
p{c\ni,Vi) 



p{ni,Vi) p{c)p{n^\c)p{vi\c) 
for any c, which we substitute into (H) to obtain 



N 



i=l cGC 



S log p{ni\c) 
+ 

Slogp{vi\c) 



(3) 



since Slogp — Sp/p. This expression is particularly 
useful when the cluster distributions p{n\c) and p{v\c) 



■^As usual in clustering models ( Duda and Hart, 1973[ ), 
we assume that the model distribution and the empirical 
distribution are interchangeable at the solution of the pa- 
rameter estimation equations, since the model is assumed 
to be able to represent correctly the data at that solution 
point. In practice, the data may not come exactly from the 
chosen model class, but the model obtained by solving the 
estimation equations may still be the closest one to the data. 



are of exponential form, precisely what will be provided 
by the ME step described below. 

At this point we need to specify the clustering model 
in more detail. In the derivation so far we have treated 
p{n\c) and p{v\c) symmetrically, corresponding to clus- 
ters not of verbs or nouns but of verb- noun associations. 
In principle such a symmetric model may be more accu- 
rate, but in this paper we will concentrate on asymmet- 
ric models in which cluster memberships are associated 
to just one of the components of the joint distribution 
and the cluster centroids are specified only by the other 
component. In particular, the model we use in our ex- 
periments has noun clusters with cluster memberships 
determined by p{n\c) and centroid distributions deter- 
mined by p{v\c). 

The asymmetric model simplifies the estimation sig- 
nificantly by dealing with a single component, but it has 
the disadvantage that the joint distribution, p{n, v) has 
two different and not necessarily consistent expressions 
in terms of asymmetric models for the two coordinates. 

Maximum Entropy Cluster Membership While 
variations of p{n\c) and p{v\c) in equation ^ are not 
independent, we can treat them separately. First, for 
fixed average distortion between the cluster centroid 
distributions p{v\c) and the data p{v\n), we find the 
cluster membership probabilities, which are the Bayes's 
inverses of the p{n\c), that maximize the entropy of the 
cluster distributions. With the membership distribu- 
tions thus obtained, we then look for the p{v\c) that 
maximize the log likelihood 1{S). It turns out that this 
will also be the values of p{v\c) that minimize the av- 
erage distortion between the asymmetric cluster model 
and the data. 

Given any similarity measure d{n, c) between nouns 
and cluster centroids, the average cluster distortion is 



(D) = J2 

ne^/ cec 

If we maximize the cluster membership entropy 



(4) 



H = -Y^ ^p{c\n) \ogp{n\c) (5) 

TieTV cGC 

subject to normalization oip{n\c) and fixed (^, we ob- 
tain the following standard exponential forms for the 
class and membership distributions 



p{n\c) — — exp —l3d(n, c) 

Zr 



p{c\n) — — exp —f3d{n, c) 

Zn 



(6) 



(7) 



where the normalization sums (partition functions) are 
Zc = X^n exp -/3d('^, c) and Z„ = '^xp -/3d(n, c). 



Notice that d(n, c) does not need to be symmetric for 
this derivation, as the two distributions are simply re- 
lated by Bayes's rule. 

Returning to the log-likelihood variation (||), we can 
now use for p{n\c) and the assumption for the asym- 
metric model that the cluster membership stays fixed 
as we adjust the centroids, to obtain 



N 



6liS) = -^^p(c|nOWn»,c)-K<51ogZc (8) 

1=1 cSC 

where the variation of p{v\c) is now included in the 
variation of d{n, c). 

For a large enough sample, we may replace the sum 
over observations in (||) by the average over TV 

Sl{S) = - ^ p{n)^p{c\n)6pd{n,c) + 6\ogZ^ 



which, applying Bayes's rule, becomes 

p{n\c)6pd{n,c) +S\ogZc (9) 



cec 



neN 



At the log-likelihood maximum, the variation (|^) must 
vanish. We will see below that the use of relative en- 
tropy for similarity measure makes S log vanish at 
the maximum as well, so the log likelihood can be max- 
imized by minimizing the average distortion with re- 
spect to the class centroids while class membership is 
kept fixed 

or, sufhciently, if each of the inner sums vanish 



^^p(n|c)MKc) = 

cec n£l\f 



(10) 



Minimizing the Average KL Distortion We first 
show that the minimization of the relative entropy 
yields the natural expression for cluster centroids 



p{v\c) 



p{n\c)p{v\n) 



(11) 



To minimize the average distortion (10), we observe 
that the variation of the KL distance between noun 
and centroid distributions with respect to the centroid 
distribution p(t; I c), with each centroid distribution nor- 
malized by the Lagrange multiplier Ac, is given by 



5d{n, c) 



5 



E 

■uev 



+ 

-^c(E„evP("|c) - 1) 
p{v\n) 



p{v\c) 



+ Ac 5p{v\c) 



Substituting this expression into (pij[), we obtain 

p{v\n)p{n\c) 



EEE 



p{v\c) 



root 



+ Ac dp{v\c) = 



Since the dp{v\c) are now independent, we obtain im- 
mediately the desired centroid expression (|l^), which is 
the desired weighted average of noun distributions. 

We can now see that the variation 6 log Zc vanishes 
for centroid distributions given by (|ll|), since it foUows 
from (0) that 

S log Zc ^ exp —pd{n, c)Sd{n, c) 

= -/3^p(n|c)(5(i(a;,c) = 0. 

n 

The Free Energy Function The combined mini- 
mum distortion and maximum entropy optimization is 
equivalent to the minimization of a single function, the 
free energy 



H/(3 



f3 
{D) 



where (D) is the average distortion and H is the 
cluster membership entropy (|^). 

The free energy determines both the distortion and 
the membership entropy through 



(D) 
H 



d/3 
dF 



with temperature T — (3^^. 

The most important property of the free energy is 
that its minimum determines the balance between the 
"disordering" maximum entropy and "ordering" distor- 
tion minimization in which the system is most likely to 
be found. In fact the probability to find the system at 
a given configuration is exponential in F 

P oc exp — /3F , 

so a system is most likely to be found in its minimal 
free energy configuration. 

Hierarchical Clustering 

The analogy with statistical mechanics sugges ts a de- 
terministic annealing procedure for clustering (Rose et 
al., 1J90), in which the number of clusters is deter- 
mined through a sequence of phase transitions by con- 
tinuously increasing the parameter /3 following an an- 
nealing schedule. 
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Figure 1: Direct object clusters for fire 



The higher P, the more local is the influence of each 
noun on the definition of ccntroids. The dissimilarity 
plays here the role of distortion. When the scale pa- 
rameter (3 is close to zero, the dissimilarities are almost 
irrelevant, all words contribute about equally to each 
centroid, and so the lowest average distortion solution 
involves just one cluster which is the average of all word 
densities. As f3 is slowly increased, a point (phase tran- 
sition) is eventually reached which the natural solution 
involves two distinct centroids. We say then that the 
original cluster has split into the two new clusters. 

In general, if we take any cluster c and a twin c' of 
c such that the centroid pc' is a small random pertu- 
bation of Pc, below the critical /3 at which c splits the 
membership and centroid reestimation procedure given 
by equations (|^) and ( pi] ) will make pc and pc' converge, 
that is, c and c' are really the same cluster. But with 
/3 above the critical value for c, the two centroids will 
diverge, giving rise to two daughters of c. 

Our clustering procedure is thus as follows. We start 
with very low f3 and a single cluster whose centroid is 
the average of all noun distributions. For any given 
f3, we have a current set of leaf clusters corresponding 
to the current free energy (local) minimum. To refine 
such a solution, we search for the lowest (3 which is the 
critical value for some current leaf cluster splits. Ide- 
ally, there is just one split at that critical value, but for 
practical performance and numerical accuracy reasons 
we may have several splits at the new critical point. The 
splitting procedure can then be repeated to achieve the 
desired number of clusters or model cross-entropy. 



CLUSTERING EXAMPLES 

All our experiments involve the asymmetric model de- 
scribed in the previous section. As explained there, our 
clustering procedure yields for each value of /? a set 
of clusters minimizing the free energy F, and the 
asymmetric model for f3 estimates the conditional verb 
distribution for a noun n by 

where p{c\n) also depends on /3. 

As a first experiment, we used our method to clas- 
sify the 64 nouns appearing most frequently as heads 
of direct objects of the verb "fire" in one year (1988) of 
Associated Press newswire. In this corpus, the chosen 
nouns appear as direct object heads of a total of 2147 
distinct verbs, so each noun is represented by a density 
over the 2147 verbs. 

Figure shows the five words most similar to the each 
cluster centroid for the four clusters resulting from the 
first two cluster splits. It can be seen that first split 
separates the objects corresponding to the weaponry 
sense of "fire" (cluster 1) from the ones corresponding 
to the personnel action (cluster 2). The second split 
then further refines the weaponry sense into a projectile 
sense (cluster 3) and a gun sense (cluster 4). That split 
is somewhat less sharp, possibly because not enough 
distinguishing contexts occur in the corpus. 

Figure ^ shows the four closest nouns to the cen- 
troid of each of a set of hierarchical clusters derived 
from verb-object pairs involving the 1000 most frequent 
noims in the June 1991 electronic version of Grolier's 
Encyclopedia (10 million words). 

MODEL EVALUATION 

The preceding qualitative discussion provides some in- 
dication of what aspects of distributional relationships 
may be discovered by clustering. However, we also need 
to evaluate clustering more rigorously as a basis for 
models of distributional relationships. So, far, we have 
looked at two kinds of measurements of model qual- 
ity: (i) relative entropy between held-out data and the 
asymmetric model, and (ii) performance on the task 
of deciding which of two verbs is more likely to take 
a given noun as direct object when the data relating 
one of the verbs to the noun has been witheld from the 
training data. 

The evaluation described below was performed on 
the largest data set we have worked with so far, ex- 
tracted from 44 million words of 1988 Associated Press 
newswire with the pattern matching techniques men- 
tioned earlier. This collection process yielded 1112041 
verb-object pairs. We selected then the subset involving 







, 

1 

I 

1, 




♦ ♦ train 

■ ■ test 

Q- - - -Q new 



























100 200 300 400 

number of clusters 



Figure 3: Asymmetric Model Evaluation, AP88 Verb- 
Direct Object Pairs 

the 1000 most frequent nouns in the corpus for clus- 
tering, and randomly divided it into a training set of 
756721 pairs and a test set of 81240 pairs. 

Relative Entropy 

Figure ^ plots the average relative entropy of several 
data sets to asymmetric clustered models of different 
sizes, given by 

J2D{t„\\p„) 

n 

where t„ is the relative frequency distribution of verbs 
taking n as direct object in the test set. For each critical 
value of (3, we show the relative entropy with respect 
to the asymmetric model based on Cfj of the training 
set (set train), of randomly selected held-out test set 
(set test), and of held-out data for a further 1000 nouns 
that were not clustered (set new). Unsurprisingly, the 
training set relative entropy decreases monotonically. 
The test set relative entropy decreases to a minimum 
at 206 clusters, and then starts increasing, suggesting 
that larger models are overtrained. 

The new noun test set is intended to test whether 
clusters based on the 1000 most frequent nouns are use- 
ful classifiers for the selectional properties of nouns in 
general. As the figure shows, the cluster model provides 
over one bit of information about the selectional prop- 
erties of the new nouns, but the overtraining effect is 
even sharper than for the held-out data involving the 
1000 clustered nouns. 

Decision Task 

We also evaluated asymmetric cluster models on a verb 
decision task closer to possible applications to disam- 
biguation in language analysis. The task consists judg- 
ing which of two verbs v and v' is more likely to take a 
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given noun n as object, when all occurrences of (v, n) 
in the training set were deliberately deleted. Thus this 
test evaluates how well the models reconstruct missing 



data i: i the verb distribution for n from the cluster cen- 



troids close to n. 

The data for this test was built from the training 
data for the previous one in the following way, based on 
a suggestion by Dagan et al. (1992). A small number 
(104) of (w, n) pairs with a fairly frequent verb (between 
500 and 5000 occurrences) was randomly picked, and all 
occurrences of each pair in the training set were deleted. 
The resulting training set was used to build a sequence 
of cluster models as before. Each model was used to 
decide which of two verbs v and v' are more likely to 
appear with a noun n where the (f , n) data was deleted 
from the training set, and the decisions compared with 
the corresponding ones derived from the original event 
frequencies in the initial data set. More specifically, for 
each deleted pair (w,n) and each verb v' that occurred 
with n in the initial data either at least twice as fre- 
quently or at most half as frequently as w, we compared 
the signof logp„(i;)/p„(w') with that of logp„(w)/p„(t;') 
for the initial data set. The error rate for each model 
is simply the proportion of sign disagreements in the 
selected {v, n, v') triples. Figure || shows the error rates 
for each model for all the selected [v, n, v') {alt) and for 
just those exceptional triples in which the log frequency 
ratio of (n,v) and {n^v') differs from the log marginal 
frequency ratio of v and v' . In other words, the excep- 
tional cases are those in which predictions based just on 
the marginal frequencies, which the initial one-cluster 
model represents, would be consistently wrong. 

Here too we see some overtraining for the largest 
models considered, although not for the exceptional 
verbs. 



CONCLUSIONS 

We have demonstrated that a general divisive cluster- 
ing procedure for probability distributions can be used 
to group words according to their participation in par- 
ticular grammatical relations with other words. The re- 
sulting clusters are intuitively informative, and can be 
used to construct class-based word coocurrence models 
with substantial predictive power. 

While the clusters derived by the proposed method 
seem in many cases semantically significant, this intu- 
ition needs to be grounded in a more rigorous assess- 
ment. In addition to predictive power evaluations of 
the kind we have already carried out, it might be worth 
comparing automatically-derived clusters with human 
judgements in a suitable experimental setting. 

Moving further in the direction of class-based lan- 
guage models, we plan to consider additional distribu- 
tional relations (for instance, adjective-noun) and ap- 
ply the results of clustering to the grouping of lexi- 
cal associations in lexicalized grammar frameworks such 



as stochastic lexicalized tree-adjoining grammars (Sch- 



abes, 1992). 
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